In [1]:
%pylab inline
In [2]:
import nltk
import matplotlib.pyplot as plt
In [3]:
# Try changing this path to point to a different text file.
raw_text = nltk.load('../data/lorem.txt')
In [4]:
raw_text
Out[4]:
Tokenization is the process of splitting up a text into "tokens". For our purposes, the tokens that we are interested in are usually words.
The simplest approach might be to split the text wherever we find a space. This can work just fine, but there are several undesirable artefacts; newlines and punctuation are kept.
In [5]:
raw_text.split(' ')
Out[5]:
A better way is to split on any whitespace, and to separate punctuation from adjacent tokens. The nltk package has a tokenizer called the Penn Treebank Tokenizer that does the job.
In [6]:
# Shortcut for Penn Treebank Tokenizer.
from nltk import word_tokenize
In [7]:
tokens = word_tokenize(raw_text)
In [8]:
tokens
Out[8]:
In most of the workflows that we will use in this course, it is typical to represent a document as a list of tokens like the one above.
NTLK provides a representation called Text
that provides some helpful tools for exploring your document. It is basically just a wrapper around the tokens.
In [9]:
text = nltk.Text(tokens)
We can find all of the occurrences of a token in the document using the concordance()
method.
In [10]:
text.concordance('velit')
Similarly, the dispersion_plot()
shows the relative position in which specific tokens occur.
In [11]:
text.dispersion_plot(['ipsum', 'orci', 'sem', 'justo', 'arcu', 'Fusce', 'fusce'])
We can plot the frequency of the top 20 most common tokens in the text using the plot()
method.
In [12]:
text.plot(20)
Notice in the dispersion plot example, above, that Fusce
and fusce
are treated as distinct tokens. The computer does not know that Fusce
and fusce
are the same -- one starts with an F
, and the other starts with an f
, which are two entirely different characters.
Normalization is the process of transforming tokens into a standardized representation. The simplest form of normalization is to convert all tokens to lower case, which we can do prior to tokenization using the lower()
method.
In [13]:
print raw_text.lower()[:500] # Only show the first 500 characters (0 - 499) of the string.
In [14]:
tokens = word_tokenize(raw_text.lower())
text = nltk.Text(tokens)
In [15]:
print text[:10] # Only show the first 10 tokens (0-9) in the document.
Now notice that the token Fusce
(with the upper-case F
) does not appear in the document; all occurrences of Fusce
have bee normalized to fusce
(with a lowercase f
).
In [16]:
text.dispersion_plot(['Fusce', 'fusce'])
Stemming is another useful normalization procedure. Sometimes (depending on our research question), distinguishing word forms like "evolutionary" and "evolution" is not desirable. Stemming basically involves chopping off affixes so that only the "root" or "stem" of the word remains.
The NLTK package provides several different stemmers.
In [18]:
raw_text = nltk.load('../data/abstract.txt')
tokens = word_tokenize(raw_text.lower())
print tokens[:50] # Only the first 50 tokens.
Here's an example of normalization with the Porter stemmer, which has been around since the late 1970s:
In [19]:
porter = nltk.PorterStemmer()
print [porter.stem(token) for token in tokens][:50]
...and the Lancaster stemmer:
In [20]:
lancaster = nltk.LancasterStemmer()
print [lancaster.stem(token) for token in tokens][:50]
...and the Snowball stemmer (which is actually a collection of stemmers for a whole bunch of different languages):
In [21]:
snowball = nltk.SnowballStemmer("english")
print [snowball.stem(token) for token in tokens][:50]
An alternate approach to stemming is to lemmatize tokens. Instead of chopping up any old word that it encounters, the WordNet lemmatizer tries to match tokens with known words in the WordNet lexical database, and then convert them to a known "lemma" ("lexicon headword"). If the lemmatizer encounters a token that it does not recognize, it will just leave the token as-is.
In [22]:
wordnet = nltk.WordNetLemmatizer()
print [wordnet.lemmatize(token) for token in tokens][:50]
In [38]:
stemmers = [porter.stem, lancaster.stem, snowball.stem, wordnet.lemmatize]
stemmer_labels = ['original', 'porter', 'lancaster', 'snowball', 'wordnet']
print '\t'.join([t.ljust(8) for t in stemmer_labels])
print '--'*40
for token in tokens[:30]:
print token.ljust(8),
for stemmer in stemmers:
print '\t', stemmer(token).ljust(8),
print
In [39]:
words = ['sustain', 'sustenance', 'sustaining', 'sustains',
'sustained', 'sustainable', 'sustainability']
stemmers = [porter.stem, lancaster.stem, snowball.stem, wordnet.lemmatize]
stemmer_labels = ['original', 'porter', 'lancaster', 'snowball', 'wordnet']
print '\t'.join([t.ljust(8) for t in stemmer_labels])
print '--'*40
for token in words:
print token.ljust(8),
for stemmer in stemmers:
print '\t', stemmer(token).ljust(8),
print
In [48]:
wordnet.lemmatize('knows'), wordnet.lemmatize('sleeps'), wordnet.lemmatize('types')
Out[48]:
Finally, depending on our analysis, we may want to filter out various tokens. For example, we may wish to exclude common words like pronouns. The easiest way to remove these kinds of tokens is by using a stoplist. A stoplist is simply a list of undesirable words.
NLTK provides stoplist containing around 2,400 words from 11 different languages.
In [24]:
from nltk.corpus import stopwords
Here are the English stopwords:
In [25]:
stoplist = stopwords.words('english')
print stoplist
We can filter out tokens in the stoplist like so:
In [26]:
print [token for token in tokens if token not in stoplist][:20]
We may also want to remove any punctuation tokens (note the parentheses, above). We can do that using the isalpha()
method.
In [27]:
print [token for token in tokens if token.isalpha()][:20]
In [28]:
def normalize_token(token):
"""
Convert token to lowercase, and stem using the Porter algorithm.
Parameters
----------
token : str
Returns
-------
token : str
"""
return porter.stem(token.lower())
In [29]:
def filter_token(token):
"""
Evaluate whether or not to retain ``token``.
Parameters
----------
token : str
Returns
-------
keep : bool
"""
token = token.lower()
return token not in stoplist and token.isalpha() and len(token) > 3
Let's compare the effect on our token distributions. Here's the top 20 tokens when we tokenize only:
In [30]:
unprocessed_text = nltk.Text(word_tokenize(raw_text))
unprocessed_text.plot(20)
And here's the top 20 tokens when we apply our processing pipeline:
In [31]:
proccessed_text = nltk.Text([normalize_token(token) for token
in word_tokenize(raw_text)
if filter_token(token)])
proccessed_text.plot(20)